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Abstract — We construct a channel coding scheme to achieve 
the capacity of any discrete memoryless channel based solely 
on the techniques of polar coding. In particular, we show how 
source polarization and randomness extraction via polarization 
can be employed to "shape" uniformly-distributed i.i.d. random 
variables into approximate i.i.d. random variables distributed ac- 
cording to the capacity-achieving distribution. We then combine 
this shaper with a variant of polar channel coding, constructed by 
the duality with source coding, to achieve the channel capacity. 
Our scheme inherits the low complexity encoder and decoder 
of polar coding. It differs conceptually from Gallager's method 
for achieving capacity, and we discuss the advantages and 
disadvantages of the two schemes. An application to the AWGN 
channel is discussed. 

Index Terms — Capacity-achieving codes, channel polarization, 
polar codes, randomness extraction, source polarization 

I. Introduction 

POLAR codes, introduced by Ankan [ ], are the first set 
of codes that provably achieve the symmetric capacity 1 
of any discrete memoryless channel (DMC) [2], using encod- 
ing and decoding algorithms whose complexity is essentially 
linear in the blocklength N. 2 By now, the polarization phe- 
nomenon at the heart of polar coding has been adapted for use 
in a variety of information-processing tasks. 

Being a family of linear codes, polar codes do not achieve 
the true channel capacity whenever the optimum input distribu- 
tion is not uniform, which is generically the case for arbitrary 
DMCs. As noted in [ ], Gallager's method [ , p.208] of "shap- 
ing" blocks of independent uniformly-distributed encoded 
message bits into (a rational approximation to) an arbitrary 
distribution of a channel input symbol can be combined with 
polar coding to approach the channel capacity. The shaper 
essentially creates a super-channel whose optimal input dis- 
tribution is uniform, so that concatenating the usual multi-bit 
polar encoder with the shaper results in an encoder suitable for 
approaching capacity. The overhead of the shaper complicates 
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the encoding and decoding algorithms, though does not affect 
the scaling of the complexity in the blocklength for fixed 
accuracy in approximating the non-uniform distribution. 

Here we use the techniques of polar coding to give a 
more information-theoretic shaper construction and exhibit a 
modified family of polar codes which can achieve the capacity 
of any DMC. Instead of approximating a single input-bit, our 
shaper approximates a string of i.i.d. input-bits. Compared to 
Gallager's method, this leads to a conceptually different coding 
scheme having better encoding and decoding complexity. (See 
Section VIII for a comparison of the methods.) 

The idea of our shaper is to run a randomness extractor for 
the optimal input distribution in reverse, a technique previously 
exploited by two of us to construct capacity-achieving codes 
in the context of one-shot channel coding [4]. As in [4], we 
construct the outer polar code 3 by exploiting the duality be- 
tween channel coding and source coding with side information, 
detailed for polar coding in [ ]. 

To understand the main idea more concretely, suppose that 
W : X — > y denotes a DMC with binary input alphabet X = 
{0, 1}, arbitrary output alphabet y and transition probabilities 
\N(y\x), x e X ,y e y. \N L denotes the channel corresponding 
to L uses of W. We consider binary-input DMCs only for 
convenience; the techniques of [ ] and [ ] can be used to 
generalize the scheme to DMCs with arbitrary input size. 
Furthermore, let Bernoulli (p) for p e [0, 1] be the capacity- 
achieving input distribution, so that I(X:Y) = C(W), for 
X ~ Bernoulli (p) and Y = W(X). Given L i.i.d. instances 
of X, roughly H(X L ) = LH^ip) approximately-uniformly 
distributed bits can be extracted, where denotes the binary 
entropy [ ]. Heuristically, we may thus hope to simulate 
X L by inputting LH^p) uniform bits to the inverse of the 
extractor. 

Given X L , an extractor function may be stochastically run 
in reverse by making use of the joint distribution of its inputs 
and outputs. Given an extractor output value, an input value 
is chosen randomly among the preimages according to the 
conditional distribution induced from the joint distribution by 
fixing the output value. However, it is not clear this process 
can be done efficiently for arbitrary input distributions. 

Luckily, this process is efficient for extractors based on the 
source polarization phenomenon. A polarization extractor for 
X L simply generates U L = X L G L (when L = 2 e for £ e Z+) 
using the channel transform Gl = (} ?) and keeps only 
those Ui such that H (UilU 1 ^ 1 ) > 1 — e for some specified e. 
Polarization ensures that there will be roughly LH^p) such 

3 The outer polar code is the code for the super-channel. 
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Ui. 4 To invert this extractor, we first build up a vector U L 
by filling with uniformly-distributed input the positions i for 
which UilU 1 ^ 1 has entropy at least 1 — e and stochastically 
generating the remaining positions using the distributions of 
the Uip*- 1 . The output X L is just X L = U L G L , and, for e 
small, closely approximates X L . 5 The necessary distributions 
can be efficiently computed, a feature used in the similarly- 
constructed decompressor of polar source coding [ ]. 

Combining the shaper with the channel \N L creates a super- 
channel W'^- L , to which the usual polar coding techniques 
could be applied. However, this does not result in an efficient 
coding scheme because the likelihoods and Bhattacharyya 
parameters of W are not necessarily easy to compute. To 
regain efficiency, we instead employ a polar coding scheme 
adapted from the source compression scheme for U L given Y L 
at the decompressor. Due to its i.i.d. structure, the necessary 
parameters can be efficiently computed, meaning that the 
complexity of the resulting decoder will again be essentially 
linear in the number of uses of the channel W. 

This paper is structured as follows. In Section II we define 
the shaper and super-channel precisely. Section III details 
our coding scheme, Section IV shows that it achieves the 
capacity of any binary-input DMC, and Section V shows 
that it is reliable. Section VI then describes how encod- 
ing, decoding, and channel construction can be performed 
efficiently. Section VII demonstrates that the shaper can be 
almost completely derandomized without impacting the code 
performance. Section VIII explains the differences between the 
new scheme and Gallager's method. Finally in Section IX we 
discuss some possible modifications of the new scheme as well 
as some potential applications, in particular communication 
over the AWGN channel with an average power constraint. 

II. Polarization-Based Shaper and Super-Channel 

We briefly recount the use of source polarization in ran- 
domness extraction [5], [10], [11] and then formulate the 
shaper and super-channel. First it is convenient to introduce 
the following notation. Let [k] = {1, . . . , k}. For x e F* and 
1 c [fc] we have x[I] = [xi : i e X] and x % = [xi, . . . , Xj\. 
For an ordered set of distinct elements A Q [fc] and a e A, 
pos^ (a) denotes the position of the entry a in A. 

As described above, a X-bit polarization extractor El.k 
for X L simply outputs the K bits of U L = X L G L for 
which HiJJ^U 1 " 1 ) are greatest. We denote this (ordered) set 
of indices by £k and the output of the extractor by U l [£k]- 

The aim of randomness extraction is to output K approx- 
imately uniform bits, where the approximation is quantified 
using the variational distance. Recall that for distributions P 
and Q over the same alphabet X, the variational distance is 
defined by 5{P, Q) := \ Y, xeX \P(x) - Q(x)\. We will often 
abuse notation slightly and write a random variable instead of 
its distribution in 8. 

Using £ k we define the shaper for X L as follows 

4 Note that this is not a randomness extractor in the usual sense, which 
is designed to work for any input distribution of sufficiently high min- 
entropy [8]. 

5 Korada and Urbanke apply a similar construction, which they called 
randomized rounding, to the problem of lossy source coding in [9], 
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Fig. 1: Polarization-based randomness extractor El,k- The input 
X is first transformed to U L via the polarization transformation 
Gl, and subsequently Fl,k filters out the K bits of U L for which 
are greatest. 



Deflnition 1. The shaper Sk,l for X L is the map Sk.l 
U K -> X L taking input U K to X L = U L G L , with 
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Here Zi is a random variable generated from the distribution 
of UilW- 1 , using U L = X L G L . 
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Fig. 2: Generation of an approximation to X L from a uniform input 
U K using the shaper Sk,l- U l is first constructed by Rk,l from 
the uniform input according to (1). Applying Gl gives X L , which 
has nearly the same distribution as X L . 

Using the shaper with uniform input U K (a i^-bit vector 
whose entries are i.i.d. Bernoulli (ij) generates an approxi- 
mation X L := S k ,l{U K ) to X L (see also [9, Lemma 11]). 

Lemma 1. For e > and K such that H (UilU 1 " 1 ) > 1 — e 
for all i 6 £k, 



Proof: Let U L be the i-bit string obtained when using 
the shaper with uniform input U K (cf. Eq. 1). We have X L = 
U L G L and X L = U L G L and, hence, 

8{X L ,X L ) = S(U L 1 U L ) . (2) 

We will bound the distance on the right hand side. For 
this, we introduce a family of intermediate distributions 
Pr? rr a a , for i = 0, . . . , N, defined by 

U-L---UiUi + 1 ---U L 



p. 
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where the last line follows from the fact that the variational 
distance is non-increasing under stochastic maps [12] (we 
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apply this to the map that generates Ui+i ■ ■ - Ul according 
to the distribution P{j. +1 ...(j l \(j 1 ...(j.)- Each term of the sum 
can be written as S(P U i-iP I y .\fyt-i, Pi/i-iPu^jji-i) or, equiv- 

alently, Ejji-i ^(P^.^-i, Pjj^jji-i) . To bound this, we use 
Pinsker's inequality [ I 3, p.58] as well as the concavity of the 
square root, 
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By construction, the conditional distribution of Ui for all i e 
£ k is the uniform distribution, so that 
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Furthermore, for all i ^ the conditional distribution of Ui 
equals P^iyt-i, so that the corresponding term in the sum (5) 
vanishes. The sum can thus be rewritten as 



8(U L ,U L ) ^ J] 



In 2, 



(10) 



ie£ K 

from which the assertion follows. 
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Fig. 3: The super-channel W K L := W L o Sk,l, shown here acting 
on the uniformly-random input U K , which results in Y L . 



Concatenating the shaper with the channel gives the super- 
channel \N' K L := W L oSk,l- Monotonicity of the variational 
distance gives the following lemma, which is the basis of our 
coding scheme. Letting Y L := \N L (X L ) and Y L = \N L (X L ), 
we have 

Lemma 2. For e 5= and K such that H (U^U 1-1 ) > 1 — e 
for all i e £ k, 



In 2 



5[(U K ,Y^),{U L [£ K lY L ))^K^— t 

Proof: Lemma 1 implies S((X L , Y L ), (X L , Y L )) ^ e' 
by the monotonicity of the variational distance under stochas- 
tic maps. Applying Gl to X L or X L and marginalizing 
over the elements not in £k is also a stochastic map, 
so 5({U l [£k],Y l ),{U l [£ k ],Y l )) < e'. Observing that 
U L [£ K ] = U K completes the proof. ■ 

III. Coding Scheme 

As in Gallager's original approach, our coding scheme is 
based on concatenating an outer coding layer for reliable 
transmission through the super-channel with an inner shaping 
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Fig. 4: The coding scheme for L = 4, M = 2 and K = 2. At the 
outer layer, polar codes are used to provide reliable communication 
over the super-channel \N' K L , by using the multilevel coding method 
to treat it as a sequence of binary input channels. The encoder and 
decoder are constructed from the compressor and decompressor for 
the task of compressing U L [TZ <! ] relative to side information Y L at 
the decoder; in particular, the frozen input bits correspond to the 
compressor outputs. Here U l [£k] is the output of the polarization- 
based randomness extractor applied to the random variable X L , 
which has the optimal distribution for achieving the capacity of the 
physical channel W, and Y L is the corresponding channel output. 
At the inner layer, polarization is again used to shape the uniform 
inputs from the outer layer into a good approximation to X L for 
transmission over W. 



layer to realize \N' K L . In principle, polar codes may be em- 
ployed for this purpose, using the multilevel coding described 
in [2, Section III.B]. 6 There, a channel with multiple input 
bits (assumed to be uniformly distributed) is decomposed into 
a sequence of binary-input channels and usual polar coding is 
applied to each. In the present context, the jth such channel 
W^ L maps Uj to (W K L (U K ),U j - 1 ). Letting M be the 
number of super-channel uses, the overall blocklength is then 
N := ML. Figure 4 depicts the case M = 2, L = 4, and 
K = 2. 

However, to apply the polar coding construction we would 
need to know both the output Bhattacharyya parameters (for 
code construction) and input likelihood ratios (for decoding) 
of each W'^^. These might not be efficiently computable from 
the properties of W itself, as the shaper output is not precisely 
X L . Instead, we will use the close relationship between 
channel coding and source coding with side information [5], 
[4] to construct a reliable and efficient scheme. 

Consider the general problem of compressing a uniformly- 
distributed bit U relative to arbitrary side information Y, where 
Y = W([/) for some channel W. Suppose that we have a 
compressor / decompressor pair (C, D) such that U M can be 
reconstructed from Y M and the compressor output C(U M ) 
with probability 1 - P CIT , i.e. Pr[C/ M ^ D(Y M , C(U M ))] = 
P err . Each compressor output c defines a set of codewords: all 
the values of u AI for which C(w M ) = c. Choosing a compres- 

6 This type of multilevel coding is due to Imai and Hirakawa [14], 
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sor output at random, encoding messages into the associated 
codewords, and decoding them with the decompressor D then 
leads to a block error probability (averaged over uniformly- 
chosen input messages and codebooks) of P err [4, Lemma 2]. 7 

Therefore, in order to construct an efficient and reliable 
coding scheme for the super-channel, we look for an efficient 
and reliable compression scheme for U K relative to Y L . Due 
to Lemma 2, any compression scheme for U l [£k] relative 
to Y L will only incur a negligible additional probability 
of error when applied to (U ,Y L ) (cf. Theorem 3). Polar 
coding provides such an efficient and reliable scheme. Thus, 
by assuming the model (U l [£k], Y l ) instead of the true 
parameters (U K ,Y L ), the super-channel decompressor bene- 
fits from the independence of X L for efficient decompression 
while incurring negligible error overhead. 

To be more precise, let Vi be the zth bit of U L [£k] ■ Given 
M copies of every Vi, we can use standard polar source coding 
on each of these sequences in turn to compress Vj relative 
to the side-information Y V 1 . The compressor outputs 
those bits of T« = V t M G M for which H{rf ) \Y M T^- 1 ^>) 
exceeds some fixed threshold e; call this set C t . The Bhat- 
tacharyya parameters and likelihood ratios associated with 
(V., Y L V 1 ^ 1 ), necessary to determine C e and to construct 
the decoder, are precisely those computed in the polar source 
coding scheme of X relative to side information Y. 

To turn this into channel coding, we simply fix (freeze) the 
value of the bits in C e , use the bits in the complement C c t as 
data bits, and map messages to codewords by applying Gm- 
The values taken by the frozen bits are known to the decoder 
and one can use the source coding decompressor to decode 
the associated W'jp L channel input. Note that the must 
be decoded in order, as is part of the channel output for 
all subsequent channels. 

For each i the above scheme operates at a rate of 1 — 
H(V l \Y L V 1 - 1 ), yielding a total rate per W K L use of 

K 

£l - Hty^V*- 1 ) = K - H(U L [£ K ]\Y L ) . (11) 
»=i 

Dividing this rate by L then gives the rate per use of W, 

R:= lim y [\£ K \ - H(U L [£ K ]\Y L )] . (12) 

IV. Achieving Capacity 

We now show that a suitable choice of K enables our 
scheme to achieve the capacity of the physical channel W. 
To do so we make use of the polarization property of the 
Ui\U z ~ 1 for a given X L . Consider the two (ordered) sets 

K e := {i e [L] : H{U l \U i - 1 ) > 1 - e] and (13) 
£>e := {« e [L] : H(Ui\U ) e} (14) 

of essentially random and deterministic variables, respectively. 
From Theorems 1 and 2 of [ ] we have \TZ e \ = LH^(p) — o(L) 
and \V e \ = L(l-H h (p))-o(L) with e = 0{2~ L? ) for /3 < \. 

7 Note that transforming this code into one with small worst-case error 
probability would still require an expurgation argument. 



As an aside, observe that choosing £jc = TZ e with K = \TZ e \ 
yields a good shaper by Lemma 1, which gives the following 

Theorem 1. 5(S\ RelL (U^),X L ) = 0(L2- 1 2 L ^) for (3 < \. 

It is simple to show that the coding scheme achieves C(W). 

Theorem 2. R = C(W). 

Proof: Applying the chain rule to H{U L \Y L ) gives 

H(U L \Y L ) = H(U L [K £ ]\Y L ) + H (U L [K3\Y L U L [K e ]) 
> H(U L [K e ]\Y L ) , (15) 

where TZ c e is the complement of lZ e in \L\. Since 
H{U L \Y L ) = H{X L \Y L ) = LH(X\Y) and H(X) = 
H\>(p), by (11) and the properties of lZ e we find 

lim — [LHb(p) — o(L) — LH(X\Y)] = C(W). (16) 

As R cannot exceed the capacity, we have R = C(W). ■ 

V. Reliability 

In this section we analyze the reliability of the coding 
scheme, starting with a general lemma on the reliability of 
using the "wrong" compressor / decompressor pair in the 
problem of source coding. 

Lemma 3. Let X and X' be arbitrary random variables 
such that 8(X',X) r\ and let W denote an arbitrary 
stochastic map. If C and D are a compressor / decompressor 
pair for (X,\N(X)), such that Pr[X ^ X] s$ rf where 
X = D(W(X), C(X)), then, for X' = D(\N(X'),C(X')), 

Pr[x' ^ X'] SS77 + 77'. 

Proof: Note that the pairs (X,X) and {X',X') are 
obtained from X and X' by applying the stochastic map 
that takes x to (x, D(\N(x), C(x))). Because the variational 
distance is non-increasing under such maps, we have 

d((x, x), (x', x')) < s(x, x') < n . (17) 

Furthermore, defining (X, X) to be the random variable 
(X,X) with distribution P X x = Px$xx> we nave 

S((X,X),(X,X)) =Pr[X# X] *zrf . (18) 

Hence, applying the triangle inequality, we obtain 

S((X,X),(X',X')) ^ri + rf ■ (19) 

Now note that the variational distance can also be written as 

5{A,A')= J] P A ,{a) - P A (a) . (20) 

a:P A (a)^P A ,(a) 

Applied to A = (X,X) and A' = (X',X'), and using that 
Pxx(x,x) = for x 7^ x, we immediately obtain 

X), (X',X')) > 2 P x .x,{x, x) , (21) 

which implies that Vx\X' ^ X'] ^rj + rf. ■ 

Next we analyze the reliability of the multilevel coder. Sup- 
pose we would like to compress (L instances of) (Vi, . . . , V n ) 
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relative to side information Y, by sequentially compressing 
Vi relative to V l ~ x Y . Define V to be the output of the 
decompressor, let Ai be the event that Vi Vi (i.e. that the 
decompressor makes a mistake at position i), and let Bi := 
^>\=iAk- Note that Pr[S„] is the probability of incorrectly 
decoding at least one Vi for ie [n]. Let r be a bound on the 
probability of that we decode incorrectly at any step and that 
the previous steps are all correct: Pr[A, n r for all 

j e [n]. Then 

Lemma 4. For n e Z + and r as defined above, we have 

Pr[2?„] s£ nr (22) 

Proof: The proof proceeds by induction over n; the case 
n = 1 holds by assumption. The induction step is as follows: 

Pr[B n+ i] = Pr[S„ u A n+1 ] (23) 

= Pr[B n ] + Pr[A n+1 n (24) 

^ Pr[B„] + r (25) 

< (n + l)r. (26) 

where (25) follows by assumption and (26) uses the induction 
hypothesis. ■ 
Now the statement of reliability follows easily. 



Theorem 3. The error probability of the coding scheme 
satisfies P crr = 0{L2~ 



- M " + L2-hL fi ') f or p,p>\. 



Proof: For the polar source coding scheme, note that 
Pr[A; n + x e 0(2~ M ), where x is the probability 

that Vi Vi given that a mistake previously occurred, but 
where we still give the correct V 1-1 to the decompressor. We 



can therefore upper bound r in Lemma 4 by 0(2 



-M 13 \ 



Thus, the probability of incorrectly decoding any of the \1Z, 
V is 0(L2- M ' 3 ); this is rf in Lemma 3. Lemma 2 and 
the properties of TZ e give rj 
establishing the theorem. 



0(L2-^ Lfi ') for p > i 



VI. Efficiency 

Here we consider the encoding, decoding, and construction 
complexity of the coding scheme. Construction of the codes 
presented in Section III requires the random set lZ e for the 
shaper at the inner layer, and the deterministic sets (the T> e ) 
for the tasks of compressing Vi relative to side information 
yLyi-i to determine the frozen bits at the outer layer. In 
principle, these sets could be constructed by simulation, as 
in [ ]. More satisfying would be a linear- time algorithm along 
the lines of [15], [16] for the source coding problem in which 
the variable to be compressed is not uniformly-distributed. 
Presumably that algorithm can be adapted to the problem of 
finding the frozen bits at the outer layer, as the compressor 
actually used in Section III is for an almost uniformly- 
distributed random variable (cf. Lemma 2). The complexity 
of constructing the outer layer would then be O(N), where 
N = ML. 

Proposition 1. The encoder has complexity O (NlogN). 



Proof: The encoder consists of two parts, an outer and an 
inner encoder. The outer encoder consists of \lZ e \ multiplica- 
tions with the matrix Gm, each requiring 0(M log AI) oper- 
ations [ ]. Recalling the fact that \TZ e \ = O(L), we conclude 
that the complexity for the outer encoding is O (ML log M). 

The inner encoder consists of M rounds of the shaper 
S|7?, e |,L, for which the necessary multiplication with Gl can 
be done in 0(L\og L). To construct U L , first note that by 
Definition 1 nothing has to be computed for i e lZ e . For 
i ^ lZ e , Zi can be generated using the likelihood ratio 



Pr[U z = 0\U l 



Pi[Ui = 1 1 ?7* 



(27) 



since Zi - Bernoulli (L» (u 1 - 1 ) / (L^ (u 1 - 1 ) + l)). All 
L^ for i e [L] can be computed recursively with complexity 
O(ilogL) [I]. Thus, the inner encoding has O (MLlogL) 
complexity. Combining the inner and outer encoding complex- 
ity establishes the claim. ■ 
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Fig. 5: Encoding circuit for the setup L = 4, M = 2, K = 2, 



Ek = {1,4}. Here Sj denotes the j-th internally-generated bit of 

the shaper corresponding to the i-th super-channel, while is the 
j-th input to the i-th encoder at the outer layer. The small gray dots 
represent variables in the network and correspond to nodes in Fig. 6. 



An important feature of the decoder is that the inner layer 
(super-channel) decompressors must be interleaved with the 
outer layer decompressors in order to ensure that all required 
variables are known at the appropriate steps. To illustrate, 
we explain in detail how the decoding is done for the setup 
L = 4, M = 2, K = 2 and £ K = {1,4}. 8 The logical 
structure of the successive cancellation decoder is shown in 
Figure 6. Figure 10 of [ ] depicts a similar representation of 
the original successive cancellation decoder. To see the close 
affinity between the encoding and decoding process, Figure 5 
visualizes the encoder for the setup defined above. 

8 Recall that this implies that we have two compressors at the outer layer 
and two super-channels having a two bit input and a four bit output each. 
The second and third output of both shapers S2,4 are randomly distributed 
according to ( 1 ) and are assumed to be known at the decoder. 
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Each node in Figure 6 is responsible for computing a 
LR arising during the algorithm; the parameters below each 
node represent the variables involved in the associated LR 
computation. Starting from the left we traverse the diagram 
to the right at whose border we can compute the LRs. Then 
we transmit the results back to the left. Here denotes the 
j-th output of the i-fh decompressor at the outer layer and 
denotes the j-th frozen input for the i-th super-channel. 




Fig. 6: Logical structure of the successive cancellation decoder for 
the setup L = 4, M = 2, K = 2, £ k = {1, 4} (compare with [1, Fig. 
10]). Note that denotes the j-th output of the j-th decompressor 
at the outer layer and Sj denotes the j-th internal input to the i-th 
super-channel. The numbering of the nodes represents the order in 
which they get activated in the decoding process. 

The decoding begins by activating node 1, which would like 
to compute the LR for T± given Y®. For this it needs the 
LRs for the first inputs to the two super-channels, and so node 
1 activates node 2, which is responsible for computing the LR 
for the first input to the first super-channel. This computation 
proceeds exactly as the usual successive cancellation decoder, 
recursively combining the LRs of the physical channels by 
calling node 3 and then 6. Assembling their results, node 2 
can compute its LR and transmits the result to nodes 1 and 
16. Meanwhile, node 1 has also requested the LR of node 9, 
which performs the same calculation as node 2 for the second 
super-channel, again forwards the result to nodes 1 and 16. 
Now node 1 is able compute the final desired LR and can 
therefore guess tf^K Having that value, node 16 can guess 
$p, completing first decompressor of the outer layer. 

Node 16 passes control to node 17 in order to compute the 

( 2) 

LR for T-y . This requires the LR for second inputs to the two 
super-channels, so nodes 18 (and later 21) are called. Node 



18 finishes the decompression of the first super-channel in 
the usual way, while node 21 completes the decompression of 
the second super-channel. Neither of these can occur until the 
first outer layer decompressor is finished. After the inner layer 
decompression is complete, node 17 can guess t\ and node 
24 can finally guess t y 2 ' , completing the second decompressor 
of the outer layer. In general, decompression of the M different 
fc-th inputs at the inner layer has to wait for the (k — l)-th 
decompressor to finish at the outer layer. 

Proposition 2. The decoder has complexity 0(N log N). 

Proof: The decoder proceeds by employing, in sequence, 
the \lZ e \ decompressors for blocklength-Af compression of 
Vi given Y L V i ~ 1 . This ensures that at all times the decoder 
has all the required previous inputs V 1 ' 1 . Each decompressor 
can be executed using 0(M log A/) operations, given the 
corresponding likelihood ratio (LR) of Vi\Y L V l ~ 1 . All such 
likelihoods can be computed in O(LlogL) steps, and each of 
the M super-channels requires its own likelihood calculation, 
as the values taken by V' 1 ^ 1 can differ in each case. Using 
\lZ e \ = O(L), we find that the decompressor has complexity 
0(N log N). M 

VII. Derandomization 

Our coding scheme requires randomness at both the inner 
and outer layers. At the inner layer, the shaper randomly 
generates the inputs in TZ c e , while the values of the frozen 
bits are to be chosen randomly at the outer layer. As the error 
probability of the coding scheme is the average over the pos- 
sible assignments of these random values, at least one choice 
must be as good as the average, meaning a reliable, efficient, 
and deterministic coding scheme must exist. Thinking of the 
random choices as part of the code construction rather than the 
encoder, it follows by the Markov inequality that most choices 
will lead to coding schemes with these properties. Nonetheless, 
it is useful to consider derandomizing the construction, if only 
because randomness can be difficult to generate. 

At the inner layer, the shaper of our coding scheme can 
be almost completely derandomized while incurring only a 
negligible overhead in error probability. Specifically, we alter 
the shaper so that for i e T> € , £/j is fixed to the most likely 
value of the distribution XJ^U 1 ^ 1 , while the Zi corresponding 
to indices in the leftover set A t := lZf\D e are generated 
randomly as before. Since \A e \ = o(L), the required rate of 
randomness vanishes in the limit of large L. Nevertheless, 
the resulting scheme is still reliable; letting P^. r be the error 
probability of the coding scheme using the modified shaper 
and P crr as in Theorem 3, we have for /3 < | 

Theorem 4. P crr (l + O (l (l - 2- 2-i "))) . 

For the proof we need the following result 

Lemma 5. Let R be a Bernoulli (p) distributed random 
variable with p e [|, l] such that H(R) e. Then p > 2~ e . 

Proof: Using p e [5,1] and some basic calculus we find 
ff(Je) + log(p) = (l-p)log( T ^-)>0. (28) 
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Thus, by the premise, e > H(R) > — log (p). ■ 
Proof of Theorem 4: Let u i denote the most likely 
sequence according to P v l . Then, by the union bound, 

P CII > P4 r Pr[U L [V e ] = u L [V e ]] (29) 
>P e ' rr (l- 2 PrfCT^Sj]). (30) 

Each term in the summation may be written Pi[Ui itj] = 
Pr[C/ 2 ^ u 4 1 t/ 4 " 1 = m 1 " 1 ] PrfC/ 4 " 1 = u i_1 ]. But, 
from the fact that for i e T> € , |Z7* _1 ) ^ e, according to 
Lemma 5 the conditional probability is upper bounded by 
1 — 2 _c , regardless of the value of U l . Using the size of 
T> e and form of e completes the proof. ■ 

VIII. Comparison with Gallager's Method 

The main difference between the coding scheme presented 
in Section III and Gallager's method [ , p.208] is that the 
shaper Sk,l approximates the i-dimensional vector X L with 
X L , whereas Gallager's shaper Sq approximates the one- 
dimensional random variable X through X. Therefore, the 
super-channel \N' K L consists of L W channel uses, while W' G 
consists of a single W channel use. Note that N, as previously 
defined, denotes the number of physical channel uses. 



u 



log q 



X 



W 



Y 



Fig. 7: Gallager's super-channel \N' G : = WoSg- A q-axy input JJ losq 
(with q = 2 m for m e Z + ) whose elements are i.i.d. Bernoulli (|J 
distributed is shaped into a rational approximation to X, i.e. X ~ 
Bernoulli (k/q) where k e Z + and k/q * p. 



Gallager's method is based on the approximation of p by 
k/q, where k e Z + and q = 2 m for m e Z + . For a binary 
channel whose optimal input is Bernoulli (p) for an irrational p 
requires, in principle, an infinitely-large q. The crucial question 
is how fast q must increase relative to N. 

It is simple to verify that 



5(X,X) 



mm 



P- 



1 

2q- 



(31) 



Then the polar coding scheme introduced in [17] 9 can be 
applied to the super-channel W' G ; it has an encoding com- 
plexity of O (log q ■ N log N) and a decoding complexity 
O (qlogq ■ NlogN). Furthermore the probability of error 
behaves as O ^logg • 2 _Ar/i J for /3 < |. Using this scheme 
leads to 

Proposition 3. Gallager's scheme achieves a rate o/C(W) — 
0(| log<?) for channels W with an irrational optimal input 
distribution. 

'Note that in terms of complexity this scheme improves the scheme initially 
proposed for Gallager's method [2], 



TABLE I: Summary of the important parameters for the two 
different schemes for M = L = vN. Recall that j3 < \ . 



Gallager's scheme 



Our scheme 



Rate 
Complexity 

Error probability 



C-O (Mogg 
O (q log q ■ N log N) 

0(logq-2- N ^ 



C- 



JV 



O (N log N) 
o(VN2-h N ^ 



Proof: Using (3 1 ) and the monotonicity of the variational 
distance gives 

1 



6{(X,Y),(X,Y)) 



2q 



(32) 



From [ 3, Lemma 2.7], (31) and the monotonicity of the vari- 
ational distance we obtain \H(X) — H^X} | ^ i log (2q) and 
|if(y) — H (Y") | < -log(2q). The same reasoning applied 
to (32) gives \H(X,Y) - H(X,Y) \ < Mog(4q). Using the 
chain rule leads to \H(Y\X) - H(Y\X) ' 



^ - logo+^. Thus, 



\I(X : Y) -I(X : Y)\ 

= \H(Y) - H(Y\X) - H(Y) + H(Y\X) 

3 ! 4 

^ - log q + - 

q q 



O ( - logq 



(33) 
(34) 



Table I summarizes the differences between Gallager's 
method and the new scheme. What can be said is that the 
new method has better complexity but generally worse error 
probability than Gallager's method. If q is chosen to increase 
slowly (e.g. q = O (log N)), Gallager's scheme works with a 
comparable complexity and superior error probability, but the 
rate converges much more slowly to the capacity. Choosing 
q to increase quickly (e.g. q = O(N)), on the other hand, 
the rates of both schemes converge comparably fast to the 
capacity, but the reduced error rate of the Gallager scheme is 
offset by the essentially quadratic complexity. 

IX. Discussion 

We have used the polarization phenomenon to construct a 
distribution shaper and shown how it can be concatenated 
with a version of polar channel codes to yield a coding 
scheme which achieves the capacity of any binary-input DMC. 
For DMCs with arbitrary input sizes, we can again employ 
multilevel coding. 

A. Possible Modifications 

Several modifications to our coding scheme are possible. In 
principle, neither layer need be based on polar codes, and other 
randomness extractors and coding schemes which are in some 
way advantageous could equally-well be used. For instance the 
"invertible extractors" of [18] may prove suitable (provided 
such invertible extractors can be used for shaping). However, 
designing outer layer codes and decoding them efficiently may 
prove challenging, as the properties of the super-channel may 
be difficult to determine. One simple modification to the outer 
layer, concatenation with Reed-Solomon codes, can lead to an 
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improved error rate at the outer layer with almost no cost in 
computational complexity [19]. 

Within the realm of polar codes, one could use q-ary codes 
for the outer layer [2], [17], instead of multilevel coding. 
Similarly, q-ary polar source coding could be used to design 
shapers for channels with non-binary input [ ]. Following the 
analysis of Section VIII, it can be verified that using a 2 K - 
ary polar code at the outer layer leads to a worse complexity 
(O (2 L LM log M) as opposed to O (LM log M)), while the 

error probability remains the same (namely (L2- M ^ for 

/?<!). 

At the outer layer, 0(LM) bits of randomness are nomi- 
nally needed to determine the frozen inputs. However, as the 
capacity of the super-channel is presumably achieved by a 
uniform input (or non-uniform inputs add only o(L) terms to 
the mutual information), perhaps it is possible to show that it 
is indeed a symmetric channel (or at least approximately so), 
so that all choices of frozen bits are equivalent, enabling a 
deterministic choice [ , Section VI]. 

B. Applications 

It would be interesting to adapt the method presented here 
to other settings. In the realm of binary discrete memoryless 
channels, the shaping gap — the penalty in lost capacity for 
working with a uniform input distribution instead of the 
optimal one — never exceeds 6% [20], so our method is of 
limited practical utility for binary channels. However, the 
shaping gap can be arbitrarily large in other scenarios, e.g. 
input letters of differing duration [21], channels with power 
constraints on the input symbols [ ], and multi-user channels 
with cross-talk [23]. 

One possible application for the new scheme is the m- 
user MAC, where the new method might be used to achieve 
rate regions with non-uniform inputs [24], [25]. Our method 
should also be applicable to the construction of quantum polar 
codes [26], [27]. Perhaps most interesting is the benefit our 
scheme brings to the AWGN channel with an average power 
constraint, which we discuss in more detail in the remainder 
of this section. 

The capacity of the AWGN channel, with inputs constrained 
to a finite average power, can in principle be achieved by 
discretizing the inputs and employing codes for DMCs. Polar 
codes offer an efficient, capacity-achieving scheme, as de- 
scribed in [ ]. Our coding scheme improves on that method. 
Let v 5* and Z ~ N(0, v), we define for m e Z+, 

C m ,i := sup I(X : X + Z) (35) 

ELY 2 ]=£1, |supp(Px)|=S2 m 

C m ,2 ■= sup I(X : X + Z). (36) 

E[X 2 ]^l, X is m-dyadic 

These are the respective capacities for coding with power- 
constrained, but otherwise arbitrary constellations of 2 m dis- 
crete points or power-constrained constellations described by 
an m-dyadic discrete random variable X, whose probability 
distribution has the form Pxix) = k 2~ m for k e Z + and x e 
supp(Px). In the limit of large m, both quantities approach the 
true capacity of the AWGN channel, C := \ log (1 + SNR), 
whose optimal input distribution is simply X ~ Af (0, 1). 



The convergence rate of C TOj 2 is exponential in m, 

C-C m . 2 <SNR2- m , (37) 

and this rate is shown to be achievable with polar codes in [17]. 
Using our new coding scheme we can relax the constraint of 
X being m-dyadic to |supp(Px)| < 2 m and thus we can 
achieve C\ lTn using codes with the same complexity. Indeed, 
the benefit of the improved approximation C m i can be large: 
According to [ , Theorem 8], using a Gauss quadrature 
constellation leads to double exponential convergence rate, 

(SNR \ 
TTsnrJ • (38) 
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